9 research outputs found

    Extreme State Aggregation Beyond MDPs

    Full text link
    We consider a Reinforcement Learning setup where an agent interacts with an environment in observation-reward-action cycles without any (esp.\ MDP) assumptions on the environment. State aggregation and more generally feature reinforcement learning is concerned with mapping histories/raw-states to reduced/aggregated states. The idea behind both is that the resulting reduced process (approximately) forms a small stationary finite-state MDP, which can then be efficiently solved or learnt. We considerably generalize existing aggregation results by showing that even if the reduced process is not an MDP, the (q-)value functions and (optimal) policies of an associated MDP with same state-space size solve the original problem, as long as the solution can approximately be represented as a function of the reduced states. This implies an upper bound on the required state space size that holds uniformly for all RL problems. It may also explain why RL algorithms designed for MDPs sometimes perform well beyond MDPs.Comment: 28 LaTeX pages. 8 Theorem

    Making 'informed choices' in antenatal care

    Get PDF
    In the Bayesian approach to sequential decision making, exact calculation of the (subjective) utility is intractable. This extends to most special cases of interest, such as reinforcement learning problems. While utility bounds are known to exist for this problem, so far none of them were particularly tight. In this paper, we show how to efficiently calculate a lower bound, which corresponds to the utility of a near-optimal memoryless policy for the decision problem, which is generally different from both the Bayes-optimal policy and the policy which is optimal for the expected MDP under the current belief. We then show how these can be applied to obtain robust exploration policies in a Bayesian reinforcement learning setting.Comment: Corrected version. 12 pages, 3 figures, 1 tabl

    Bounded parameter Markov decision processes with average reward criterion

    No full text
    Bounded parameter Markov Decision Processes (BMDPs) address the issue of dealing with uncertainty in the parameters of a Markov Decision Process (MDP). Unlike the case of an MDP, the notion of an optimal policy for a BMDP is not entirely straightforward. We consider two notions of optimality based on optimistic and pessimistic criteria. These have been analyzed for discounted BMDPs. Here we provide results for average reward BMDPs. We establish a fundamental relationship between the discounted and the average reward problems, prove the existence of Blackwell optimal policies and, for both notions of optimality, derive algorithms that converge to the optimal value function

    Sample Complexity Bounds of Exploration

    No full text
    Abstract Efficient exploration is widely recognized as a fundamental challenge inherent in reinforcement learning. Algorithms that explore efficiently converge faster to near-optimal policies. While heuristics techniques are popular in practice, they lack formal guarantees and may not work well in general. This chapter studies algorithms with polynomial sample complexity of exploration, both model-based and model-free ones, in a unified manner. These so-called PAC-MDP algorithms behave near-optimally except in a “small ” number of steps with high probability. A new learning model known as KWIK is used to unify most existing model-based PAC-MDP algorithms for various subclasses of Markov decision processes. We also compare the sample-complexity framework to alternatives for formalizing exploration efficiency such as regret minimization and Bayes optimal solutions.

    A robust method for the synthesis and isolation of β-gluco-isosaccharinic acid ((2R,4S)-2,4,5-trihydroxy-2-(hydroxymethyl)pentanoic acid) from cellulose and measurement of its aqueous pKa

    Get PDF
    In alkaline pulping wood pulp is reacted with concentrated aqueous alkali at elevated temperatures. In addition to producing cellulose for the manufacture of paper, alkaline pulping also generates large amounts of isosaccharinic acids as waste products. Isosaccharinic acids are potentially useful raw materials: they are good metal chelating agents and, in their enantiomerically pure form, they are valuable carbon skeletons with predefined stereochemistry that can be easily functionalised for use in synthesis. Despite this, there is no simple procedure for isolating pure beta-(gluco)isosaccharinic acid and very limited work has been undertaken to determine the chemical and physical properties of this compound. We report here a very simple but effective method for the synthesis of a mixture containing equal portions of the two isosaccharinic acids ((2S,4S)-2,4,5-trihydroxy-2-(hydroxymethyl)pentanoic acid and (2R,4S)-2,4,5-trihydroxy-2-(hydroxymethyl)pentanoic acid) and the separation of the two as their tribenzoate esters. We also report for the first time the aqueous pKa of beta-(gluco)isosaccharinic acid (3.61)

    Preference-Based Policy Learning

    Get PDF
    International audienceMany machine learning approaches in robotics, based on re- inforcement learning, inverse optimal control or direct policy learning, critically rely on robot simulators. This paper investigates a simulator- free direct policy learning, called Preference-based Policy Learning (PPL). PPL iterates a four-step process: the robot demonstrates a candidate pol- icy; the expert ranks this policy comparatively to other ones according to her preferences; these preferences are used to learn a policy return estimate; the robot uses the policy return estimate to build new can- didate policies, and the process is iterated until the desired behavior is obtained. PPL requires a good representation of the policy search space be available, enabling one to learn accurate policy return estimates and limiting the human ranking effort needed to yield a good policy. Further- more, this representation cannot use informed features (e.g., how far the robot is from any target) due to the simulator-free setting. As a second contribution, this paper proposes a representation based on the agnostic exploitation of the robotic log. The convergence of PPL is analytically studied and its experimental validation on two problems, involving a single robot in a maze and two interacting robots, is presented

    Neuroblastoma and Related Tumors

    No full text
    corecore